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' Abstract. We study the stochastic multi-armed bandit problem when 

. one knows the value /i^*) of an optimal arm, as a well as a positive lower 
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bound on the smallest positive gap A. We propose a new randomized 



D ' policy that attains a regret uniformly hounded over time in this setting. 

We also prove several lower bounds, which show in particular that 
CN \ bounded regret is not possible if one only knows A, and bounded regret 

of order 1/A is not possible if one only knows /i^*). 
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1. INTRODUCTION 

\ In this paper we investigate the classical stochastic multi-armed bandit problem 

introduced by [12] and described as follows: an agent facing K actions (or bandit 
arms) selects one arm at every time step until a finite time horizon n > 1. Succes- 
sive pulls of each arm z G {1, . . . , K} yield a sequence of i.i.d rewards Y^*-*^ , ¥2^^ , . . . 
\ according to some unknown distribution Vi with expected value /i*^*^. Denote by 

' ★ G {1, . . . , any optimal arm defined such that /i^*^ = maxj=i^...^j^ /i^*-*. A pol- 

icy I = {It} is a sequence of random variables It G {!,..., iC} indicating which 
arm to pull at each time t = 1, . . . ,n and such that It depends only on obser- 
vations strictly anterior to t. The performance of a policy / is measured by its 
^ . (cumulative) regret at time n that is defined by 



t=i 

Observe that if we denote by Ti{t) = ^{h = ^} the number of times arm i 

was pulled (strictly) before time t>2 and by A,, = the gap between arm 

i and the optimal arm, then one can rewrite the regret as Rn = X^i^i AjlETj(n + 
1). This formulation will be used hereafter. 

We refer the reader to [5] for a survey of the extensive literature on this problem 
and its variations. In this paper we investigate a phenomenon that was first 
observed in [8]: with some prior knowledge (in the form of lower bounds) on 
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the maximal mean /i*^*-* and the minimal gap A = minj:Aj>o^i; it is possible to 
obtain a regret that is bounded uniformly in n, which implies in particular that 
the regret does not tend to infinity as the time horizon n tends to infinity. Note 
that this result is striking, as the seminal paper [9] indicates that, if one has no 
prior knowledge on the distributions, then asymptotically (in n) a regret of order 
logn is unavoidable. 

1.1 Contributions 

We describe in Section 2 a simple algorithm for the two-armed bandit problem 
when one knows the largest expected reward /x^*^ and the gap A. In this two- 
armed case, this amounts to knowing fi^^^ and n^'^^ up to a permutation. We show 
that the regret of this algorithm is bounded by A + 16/A, uniformly in n. The 
optimality of this bound is assessed in Section 4 where we show that any agent 
knowing A and /i^*-* must incur a regret of at least 1/A. This upper and lower 
bounds raise the following question: can such bounded regret be achieved without 
one of these two pieces of information? It follows from Theorems 6 and 8 that 
the answer to this question is negative. Indeed, the sole knowledge of either A or 
;U^*) leads to a rescaled regret Aii„ that is at least logarithmic in n. Interestingly, 
all these results are fully non-asymptotic, including lower bounds. 

What if A is not perfectly known but only e > such that A > e? We answer 
this question in Section 3 in the context of the general X-armed bandit problem. 
There, we prove an upper bound on Rn when one knows the maximal mean fi^*^ 
together with a positive lower bound e on the smallest gap A. Specifically, we 
design a randomized policy for which 

Rn<E {A. + f log(^)}. 

Moreover, it follows form our main lower bound in Theorem 8 that this result 
cannot be improved without further assumptions, since for e of order of l/\/n 
— no information on the smallest gap — a logarithmic growth in n is unavoidable 
for the rescaled regret Ai?„. However for e of order A one would expect no 
dependency on e (since at least for K = 2 our policy of Section 2 attains a 
regret of order 1/A). To deal with this issue we propose an improvement of the 
basic policy that for which the term log(l/e) is replaced by log(Aj/e) log loge. In 
particular if all the gaps Aj and £ are of the same order, the logarithmic becomes 
a log-log term. 

The exploration- exploitation tradeoff is a preponderant paradigm in the bandit 
literature. The effects of this tradeoff already appear for the case K = 2 in 
the form of the logn term derived in the original [9] paper. Indeed, there exist 
simple classes of (two!) problems over which the regret is uniformly bounded 
with full information but cannot be bounded uniformly with bandit feedback, see 
Theorem 6. Clearly, this tradeoff should become more and more apparent as the 
number of arms increases but this is not our main focus. Rather, the combination 
of our results sheds light on an interesting phenomenon: the effects of the tradeoff 
vanish when both A and fi^*^ are known but can be seen already when K = 2 
and either A or fi^*^ is unknown. 
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1.2 Related works 

The two-armed bandit problem when one knows the distributions of the arms 
up to a permutation was first investigated in [8]. The authors observed that in that 
case, using a pohcy based on the sequential likelihood ratio test, one can obtain 
a regret uniformly bounded over n. Both upper and lower bounds were provided. 
This setting was generalized in [7], where the authors considered the general 
multi-armed bandit problem when one knows a separating value 7 between the 
largest mean and the other means. In that case they proved the bounded regret 
property for a policy based on sequential likelihood ratio tests for Hq : /x > 7 
vs. Hi : /X < 7 (assuming exponential distributions to compute the likelihoods). 
They also designed a more subtle strategy for the case when only n^*^ is known. 
In that case too they proved a bounded regret property. The main open problems 
left by these works are (i) to understand the limitations of bounded regret, and 
(ii) to characterize the exact dependence on the parameters in the regret (when 
bounded regret is achievable). In this paper we make progress on both questions. 

Regarding the limitations of bounded regret, we prove three finite-time lower 
bounds, including a finite-time version of the seminal result of [9]. Ideas similar to 
the ones we develop in Theorems 5 and 6 already appeared in [6] but our results 
are fully non asymptotic with the exact dependence in the parameters involved. 
Theorem 8 is more innovative. It shows that a logarithmic growth for the rescaled 
regret ARn is unavoidable even if one knows /i^*). The proof of this result goes 
beyond any previous lower bound for the stochastic multi-armed bandit problem, 
including [7, 9], since all of them required to distinguish problems with different 
values of /x^*) (such as the ones in Theorem 6 for example). As a consequence 
of this theorem, we can deduce that the policies with bounded regret derived in 
[7, 1] with only the knowledge of must have a suboptimal dependency in 
1/A. 

The knowledge of ;U^*^ was also exploited in other works. For instance in [13], 
the authors showed that knowing /x^*) allows for policies with provably better 
concentration properties. Their policies are based on sequential likelihood ratio 
tests for Hq : /x = fi^*^ vs. ffi : /x < (assuming Gaussian distributions to 
compute the likelihoods). To some extent it was to be expected that the knowledge 
of /x^*) leads to an improved regret as it partially removes the need for exploration: 
if one arm has empirical performances close to /x^*^ , one can be confident that this 
is the best arm without worrying that it could be the best arm only because we 
have not yet explored enough the other options. However note that the problem 
turns out to be more subtle than the above simple argument and underlines the 
fact that one needs more than the knowledge of n^*^ in order to have a bounded 
regret with optimal scaling in 1/A. Indeed, Theorem 8 implies that the sole 
knowledge of /x*-*^ does not warrant the bounded property for the rescaled regret 

ARn. 

1.3 Basic assumptions 

Throughout the paper, we assume that the distributions Ui are sub-Gaussian 
that is / e^^^~^^Ui{dx) < 1"^ for all A G IR. Note that these include Gaussian 
distributions with variance less than 1 and distributions supported on an interval 
of length less than 2. 

We denote by jui*'' = \ the empirical mean of arm i after s pulls. 
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for s > 1. Together with a Chernoff bound, it is not hard to see that the sub- 
Gaussian assumption imphes the following concentration inequahty, vahd for any 
u>0, 

(1.1) p(^«_^W>^)<exp(-^). 

2. THE TWO-ARMED CASE 

In this section we investigate a toy example where K = 2 and the agent 
knows exactly both = (without loss of generality) and A. While somewhat 
simplistic this example offers a convenient framework to lay the main ideas to 



build policies with bounded reg 


;ret. 




Initialization: 






(0) For rounds t G {1, 2}, select 


arm It - 


= t. 


For each round t — 3,4, . . . 






(1) If > -A/2 and 




then select arm i, i.e., It = i- 


(2) Otherwise select both arms, 


i.e., It = 


- 1 and It+i — 2. 



Figure 1. A policy with bounded regret for the two-armed bandit problem. 



Theorem 1. Policy 1 has regret hounded as Rn < A -|- 16/A, uniformly in 

n. 

Proof. Without loss of generality we assume that 1 = * is the optimal arm. 
Observe that 

{It = 2} C {t = 2}U{/lg(,) > - A/2 ,t > 3, It = 2}U{/l5'4) < -A/2 ,t > 3, h = 2}. 

Summing over t for the second event, we get 
(2.2) 

n n n „ 

^E^i/^St) > -^/2' h = 2]< ]E5^1{/2f) > -A/2} < exp(-tAV8) < ^. 

For the third event we use the definition of the policy to obtain 

{/2g^(^) < - A/2 , t > 3, It = 2} c < -A/2 , t > 3, h^i = 1} 

and conclude as in (2.2). □ 

This policy has two weaknesses. First one may pay a big price for misspecifying 
the value of A. Namely if one only knows a lower bound < e < A and substitutes 
e to A in Policy 1, then it follows easily that the regret becomes of order A/e^. 
Furthermore, for essentially the same reason, the trivial generalization of this 
algorithm to the ii'-armed case would give a regret bounded by Aj/A^. In the 
next section we show how to overcome these two issues using a new, randomized, 
policy. 
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3. A FAMILY OF POLICIES WITH BOUNDED REGRET 

In this section we consider the general multi-armed case, when the agent knows 
^(*) = (without loss of generality) and an e > such that e < A. Akin to 
Policy 1, the policy analyzed here sets a threshold at — e/2 and prescribes to pull 
a single arm above this threshold. However if all arms have their empirical mean 
below this threshold, then the policy is more subtle than what was described in 
the previous section (where all arms were pulled in round robin fashion). Here 
the policy picks an arm at random, where the probability of selecting arm i is 
essentially proportional to (/i^_''^^p~^, which is an empirical estimate of A^^ since 
^(*) = 0. Policy 2 is slighly more general, as it uses a potential function -0 : IR+ 
]R+, and selects arm i with probability inversely proportional to 'ipd'fl^j}^^^]). The 
natural choice is iIj{x) = x^, but other choices can lead to improved performances, 
see Theorem 2 below. Note that we also analyze the case where e = (that is, 
when we have no information on the smallest gap). 



Initialization: 

(0) For rounds t G {1, . . . , K}, select arm It = t. 
For each round t ^ K + 1, K + 2, . . . 

(1) If there exists i such that /^y.^jj) > — £/2, then select It € argmaxj^<j<j^, Mr^t)- 

(2) Otherwise select randomly an arm according to the following probability distribution: 

c ^1 
Pi,t = ' where c = r-r . 



Figure 2. A family of policies with hounded regret for the K -armed bandit problem. 



Theorem 2. Fix e £ (0, 1 A A], then Policy 2 associated with the potential 
'il){x) = 'J? satisfies for all n> 1, 



(3.3) Rn< {^^ + f 

i:A,>0 ' 

2 



Furthermore for e = 0, let v = TE (y^^*^ j , then the regret is hounded as 



i:Ai>0 * 



2 

The dependency in e can be reduced by using the potential ip^x) = iog(4^/g-) since 
it yields 



(3.5) Rrt< Y: {A. + ^^^^|^[3 + loglog(^)]} 



i;Ai>0 
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If e is of the order of every Aj, then Equation (3.5) upper bounds the regret 
in loglog(l/Aj)/Aj; on the other hand, using the potential il){x) = only 
guarantees, under the same assumptions, a bound in log(l/Aj)/Aj. 

The result for e = implies that when one has no information on the smallest 
gap, our policy does not obtain bounded regret but it recovers the performances 
of UCB, [3]. As we shall see in Section 4 it is in fact impossible to obtain bounded 
regret scaling in 1/A if one only knows 

Theorem 2 is deduced from the following more general regret bound for Policy 2 
expressed in terms of the properties of the potential tp. 

Theorem 3. Fix e G [0, A] and let ip be a dijjerentiable and increasing func- 
tion ip : [e/2, +c«) — ]R^. If e > 0, Policy 2 satisfies for all n > 1, 



' — ~ ax 



(3.6) n,,<j:j^,^- + —[—^l^^ 

Furthermore for s = it satisfies 

(3.7) «„< E (a. + |^ + ^X:e^(|?«|)). 

Proof. Without loss of generality we assume that 1 = * is the optimal arm. 
We decompose the event of a wrong selection into three events: 

{It = i}c{t = i}U {ff:t\., > -A,/2 ,t>K + l, It = i} 



U < -A,/2 ,t>K + l, It = i}. 



Using (2.2) one can easily prove that the cumulative probability of the first two 
events is smaller than 1 + 8/ A?. For the third event, it is convenient to define the 
random variable Z G {0,1,2} that indicates whether the agent plays according 
to (0), (1) or (2) in Policy 2. We write the following, using the definition of the 
algorithm and the fact that ip is non-decreasing, 

^{T^T^t) ^ ,t>K + l, It = i} = IP{/iJ^(t) < -A./2 , It = i,Z = 2} 

= IE PMli^Jit) < , ^ = 2} = IE < -Ai/2 , Z = 2} 

- i^km ^ < * >K+i}. 



A simple rewriting of time then concludes the proof for the case of e = 0. We use 
the slight abuse of notation ip~'^{x) := [tp^{x)]'^, and ip{oo) = lima;_i._(.oo 'i/'(3;)- For 
e > we have 
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< -./2} < V(|/ir^|)l{/lS'^ < -e/2} 

t=i t=i 



t=l 

" fip(po) 



oh"~+/ 2e —dx\ 

s ?* (2) 



t=i 

f>'(/)(oo) 



/V'(e/2) 

+ / ,_2,^, t^a;. 

Making the change of variable x = ip{u) concludes the proof of Theorem 3. □ 

Theorem 2 follows from Theorem 3 with specific choices for ip. First, take 
il^ix) = x^, e E (0, 1] and observe that the integral in (3.6) can be computed as 



Ax /3 
— dx = —4 log (1 — e~"8") < Slog 



le/2 e"2--l 

which gives (3.3). When e = 0, since IE V'd/jf^l) = v/t, Equation (3.7) directly 
gives (3.4). 

Next, we turn to the the slightly more sophisticated potential function ^p{x) = 

2 

iog(4x/£) • Observe that for any x > 0, 



log(4x/e) log^(4x/e) ' log(4x/e) 

Therefore, for e G (0, 1], the integral in (3.6) is bounded from above by 

4x , 8 , _£i , 

2 dx < / — — — — T^dx + / 9e 2 dx 

e/2 log(4x/e)[e'^ - 1] y./2a:log(4x/e) A 

< 8 log log(4/e) - 8 log log 2 + 4 

< 81oglog(4/e) + 7. 

It concludes the proof of (3.5). 

4. LOWER BOUNDS 

We conclude our study of bounded regret in stochastic multi-armed bandits 
with three different lower bounds. For simplicity, we phrase these results for the 
simple two-armed case. First we show with Theorem 5 that if one knows both 
/i^*) and A, then the best attainable regret is of order 1/A, which matches (up 
to a numerical constant) the result of Theorem 1. Next we show in Theorem 6 
that the sole knowledge of A leads to a lower bound of order log(nA'^)/A. This 
theorem implies that the bounds of [2], [4] and [10] exhibit a tight dependence in 
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A (for the two-armed case), unhke the famous result of [9]. Moreover, compared 
to the proof of [9] , our approach is (i) much simpler, (ii) non- asymptotic and (iii) 
it is not limited to a certain class of policies. Finally we show in Theorem 8 that 
if one only knows then a regret of order '"^"^ is unavoidable (for some value 
of A). 

Our proof strategy consists in rephrasing arm selection as a hypothesis testing 
problem, and then use well-known lower bounding techniques for the minimax 
risk of hypothesis testing. For instance, the proof of Theorem 5 and Theorem 
6 builds upon the following result; see [14, Chaper 2] for a proof, or Lemma 7 
below with A chosen to be a Dirac mass at 1. Recall that the Kullback-Leibler 
divergence between two positive measures p, p' with p' absolutely continuous with 
respect to p, is defined as 

KL(p,p') = J log (0) dp = Ex., log • 

Lemma 4. Let PQ,pi be two probability distributions supported on some set 
X , with pi absolutely continuous with respect to pq. Then for any measurable 
function ip : X ^ {0, 1}, one has 

Px~po(VX^) = l)+Px~pi(V'(^) =0) > iexp(-KL(po,Pi)). 

In this section we denote hy v = vi®iy2 the product distribution that generates 
the rewards from Vj when pulling arm j G {1,2}. The regret of a policy that 
observes such rewards is denoted by Rn{y)- Finally let denote the probability 
associated to u and by Ej^ the corresponding expectation. 

Hereafter, we favor rewards that are normally distributed because they lead to 
simpler calculations of the KL-divergence. However, our lower bounds remain of 
the same order for all families of distributions {p^}^ with expected value p and 
such that KL(p, ~ Pfi') ^ C{p — p')'^ for some absolute constant C > 0. This is 
the case, for example, of the Bernoulli distribution with parameter p as long as 
p remains bounded away from and 1; see, e.g., [11, Lemma 4.1]. 

The first lower bound illustrates that when one knows the distributions up to 
a permutation, the best one can hope for is a bounded regret of order 1/A. 

Theorem 5. Let v = Af{0, 1) ® AA(-A, 1) and u' = 7V(-A, 1) (g) 7V(0, 1). 
Then for any policy, and for every n > 1, 

max {Rn{u),Rn{'^')) > ^ ■ 

Proof. In this proof we assume that the policy has access to t rewards from 
each arm at time step t. Clearly this full information setting is simpler than the 
bandit setting, and thus a lower bound for the former implies one for the latter. 
Using Lemma 4 as well as straightforward computations one obtains 

1 A " 

max(i?„(i/),i?„(z.')) > 2 {Rn{i^) + Rn{i^')) = - ^(P.(/t = 2) + P,,(/t = 1)) 

i=l 

^ T E exp(-KL(.«*, u'^')) = ^ eM-tA') > . 
t=i t=i 
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□ 

The above theorem ensures that the regret bound of Theorem 1 has the correct 
dependence in A. This is quite surprising as the original bound of [9] indicates 
that without the knowledge of /x^*^ and A, one can incur a regret that diverges to 
infinity at a logarithmic rate. The next result shows that this logarithmic regret 
already appears when one does not know the value of Thus the knowledge 
of A without the knowledge of ^jS*^ is not sufficient to obtain a bounded regret. 
Moreover, the following lower bound matches the upper bounds (for the two- 
armed case) of [2], [4] and [10], thus proving their optimality. 

Theorem 6. Let v = 5q® Af{-A, 1) and v' = 5q® AA(A, 1). Then for any 
policy, and any n > 1, 

max(i?„(.),i?„(.0)>i^^^^^. 
Proof. First note that 

max(i?„(i/),i?„(i/')) > Rn{y) > A]E^r2(n). 

Furthermore, denoting by (respectively fj) the law of the observed rewards up 
to time t under v (respectively under i^'), and following the same computations 
than in the previous proof, one also obtains 

A " 

max (i?„(i/),i?„(z^')) > — y^exp(-KL(i/t,^'f)). 

t=i 

Since under u, arm 1 is uninformative, it follows from basic calculation that 

KUiyt,u^) = 2A^]E,T2{t). 
The above three displays yield 

max(i?„(i/),i?„(i/')) > y (]E^T2(n) + ^ exp(-2A2lE^r2(n)) 



> 



A 

n 

:[o 

log(nA2/2) 



A / n , .2\\ 
> mm — \ X -\ — exp(— 2A x) 
- xe[o,n] 2 V 4 V 



4A 

□ 

Finally we prove that the knowledge of /i*^*^ without the knowledge of A is 
not sufficient either to obtain a bounded rescaled regret Ai?„. This result is more 
difficult, and falls within the more general topic of lower bounds for adaptive rates. 
First we need to generalize Lemma 4 to deal with both a composite alternative, 
and a rescaled risk. The proof of this result is standard and postponed to the 
appendix. 
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Lemma 7. Let po and PA,^ G ^ be probability distributions supported on 
some set X , with pA absolutely continuous with respect to pQ. Let X be a finite 
positive measure on TR. Then for any measurable function ip : X ^ {0, 1}, one 
has 

TPx^poWX) = 1) + AWxr^pAi^iX) = O)dA(A) > i- exp (-KL (po, p)) , 

where p is the positive measure on X defined by p = J Ap/\dX{A) and C\ = 
l + /AdA(A). 

Note that f ApaA(A) is not a probability distribution, however it is a positive 
measure thus the Kullback-Leibler divergence in the above lemma is well-defined. 

Theorem 8. Let uq = 1) ^ 1), and ua = A/'(-A, 1) (g) 7V(0, 1), 

A G (0, 1]. Then for any policy, and any n > 1, 

max ( i?„(i/o), sup Aii.„(i/A) ) > ;^ log(n/139). 
V Ae(o,i] J ^ 

Theorem 8 can be read as follows: for any policy, and any n > 1, there exists 
A G (0, 1] and a problem instance with gap A and optimal value p^*^ = such 
that on this problem one has 

log(n/139) 



Rn. > 



2A 



Proof. Similarly to the previous proof we define VQ^t and j^a,* as the law of 
the observed rewards up to time t. Lemma 7 yields 
(4.8) 

max Rn{i^o), sup Ai?„(z^A) > TTFry^exp ( -KL ( i/o,t, / AiyA,tdX{A) 
V Ae{o,i] J ^(^xfr( \ \ J 

For u G {uq, va}, define the average rewards for arm i G {1, 2} by pu \ Therefore, 
plyj = p^ul = 0, p!"uQ = —1 and p!"u}^ = —A. Recall that a policy {It]t>i taking 
values in {1,2} generates a sequence of rewards Y^^^\t > 1 distributed according 
to G {z^Oj^a}- The joint density (with respect to the Lebesgue measure) dvt 
of {yI^'\ y}^'^) G ]R*, where u G {i^a, i^o} 

can be computed easily using the 
chain rule for conditional densities. It is given by 



Choosing u = ua and 1^ = 1^0 respectively, it yields 

'^(y/^^), . . . , y/^*)) = exp ( - i ^ [(y/^^) - pi^£f - (yf ^ 



exp ( - ^ E [{yP + A)^ - (y/^V] - 1 1 [(^/'^)' - i^P + 

£=1 1=1 

( t(i) t(2) \ 

exp -l_(2A/x(i) + A2) + V(2A^'^ + 1) h 
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where we denote for simplicity 

T«=r,(t + l) = ^l{/, = i} and /i^*)=A?i,) = i^i:^/^' ^e{l'2}. 
e=i e=i 

I(=i 

Dropping the dependency in {Y-l^^^\ ... , Y^*"^) from the notation, it yields 

J A^c/A(A) = exp (^^(2/i(^^ + 1)^ / Aexp (^-I!^i2A^^(') + A^)^ dX{A) , 

and thus 



KL (^uo,t,j Au^,tdX{A) 



= ^]E,„rP) - log l^y" Aexp |^-Z^(2AA(^) + A^)^ dA(A)^ 

where the last line follows standard computations. Next, it follows from the 
Cauchy-Schwarz inequality that the function 



log Aexp((/?(A)x)dA(A) 



is convex for any function ip. Together with the Jensen inequality, it yields 
IE,„ log Aexp |^-^(2AA(^) + A^)^ dA(A)^ 

> log Aexp ^-lE,„^(2A/i(i) + A^)^ dA(A)^ 

= log Aexp (^-^^A^^ dX{A)^ 

Define r = lEj^^T^^^ and let A be the uniform distribution on [0, 1/-\/t]. Since 
^g-«V2 > ti/2 for < M < 1, it yields 

j Aexp (-^I^A'^ dX{A) = ^j\eM-nV'^)du > 
Thus we have proved that 

KL ('^o,*,^ AuA,tdA^ < ^]E,or(2)+log(4Y/]E,or(i)) 

< ^]E,oT2(n) + ^log(16n). 
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Plugging this into (4.8) one obtains 



max ( i?„(i/o), sup Aii„(i^A) ) > -^77- exp ( -■^]E,,„r2(n) 
V Ag(o,i] / V ^ 



- 7|exp(^-^]E,„r2(n)j , 

where we use the fact that r > 1, which impHes C\ < 3/2 < 2. On the other 
hand one also has 

Therefore 

max I i?„(i/o), sup ARn{i^A)\ > min ^ fx + ^ exp(-x/2)') 
\ Ae(o,i] / ^e[o,n] 2 V 16 7 

= ilog(n/139). 

□ 

Theorem 6 and 8 have important consequences on the exploration-exploitation 
tradeoff mentioned in the introduction. Indeed, consider the full information case 
where at each round, the agent observes the reward of both arms. In this case, 
it is not hard to see that the policy that indicates to pull the arm with the best 
average reward has bounded regret of order 1/A. Therefore, the knowledge of A 
or /i^*) alone does not alleviate the price for exploration. However, when both are 
known, it vanishes (see Theorem 1). 

Acknowledgments. We are indebted to Alexander Goldenshluger for bring- 
ing the reference [7] to our attention. 

APPENDIX A: PROOF OF LEMMA 7 

Throughout the proof, Radon-Nikodym derivatives over are taken with re- 
spect to a common but unspecified reference measure. It does not enter our final 
result. It follows from Fubini's Theorem that 

Wx^pMX) = 1) + y AWx^pJHX) = 0)dX{A) 
= I dpo + [ ( [ AdpA) d\{A) 



dpo+ / dp 
V'=i Jip=i 

dpo+ [ -^dpo 
^=0 Ji>=i "Po 
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Furthermore the last expression is clearly minimized for ^p(x) = 1 > l| • 

It yields 

/ dpo+ [ -j^dpo > [ dpo+ [ -^dpoix) 



dpQ dpo - 



dpo + dp 

dpQ dpo — 

min (dpo, dp) . 

Note that the latter quantity is often referred to as Hellinger affinity and does 
not depend on the reference measure on X; see, e.g., [14], Chapter 2. Now using 
the Cauchy-Schwarz inequality and the fact that 



min {dpo, dp) + j max {dpo, dp) = C\ , 

we get 

\J dpdpoj ~ \/ dpo) max(dp, dp\ 

— (^J ^^^{dp,dpo)^ max((i/), dpi 
< Cx mm{dp,dpo). 

J X 

The above three displays together yield 

Wx^pMX) = 1) + ^ AWx^pAi'iX) = O)dA(A) > ^ 7^ 
To complete the proof, observe that the Jensen inequality yields 



2 



V dpdpo ) = ( / \ H^dpo 



dpi 







21og( / J^dpo 



= exp 

> exp 
= exp[-KL(po,p)]- 



'0 



2/log(,/-|).« 
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